Time-series text visualization ===== **Arabica** provides n-gram visualization methods to describe the dataset and discover variability over time. **cappuccino** method enables standard cleaning operations and provides plots for descriptive (word cloud) and time-series (heatmap, line plot) visualization. It automatically cleans data from punctuation (using `cleantext `_) on input. It can also apply all or a selected combination of the following cleaning operations: * Remove digits from the text * Remove standard list(s) of stop words (using `NLTK `_) * Remove an additional specific list of words **Stop words** are generally the most common words in a language with no significant meaning, such as *"is"*, *"am"*, *"the"*, *"this"*, *"are"*, etc. They are often filtered out because they bring low or zero information value. Arabica enables stopword removal for languages in the `NLTK `_ corpus. To print all available languages: .. code-block:: python :linenos: from nltk.corpus import stopwords print(stopwords.fileids()) It is possible to remove more sets of stopwords at once by: .. code-block:: python :linenos: stopwords = ['english', 'french','etc..'] ----------------------------------------- :doc:`Word cloud` is a graphical representation of word importance (typically frequencies) that give greater prominence to words that appear more frequently in a source text. :doc:`Heatmap` allows us to visualize n-grams through time. It divides the data into discrete categories (boxes) by time and assigns a color to each category based on the value of the n-gram. :doc:`Line plot` displays n-grams as a series of data points called 'markers' connected by straight line segments. It is a basic type of chart common in many fields.